OcrV1, Main, Exploration, bibRecord, 000216

Google Books Ngrams Recompressed and Searchable

Identifieur interne : 000216 ( Main/Exploration ); précédent : 000215; suivant : 000217

Google Books Ngrams Recompressed and Searchable

Auteurs : Szymon Grabowski [Pologne] ; Jakub Swacha [Pologne]

Source :

Foundations of Computing and Decision Sciences [ 0867-6356 ] ; 2012-12-01.

RBID : ISTEX:1A96B9FC68E740E8E445E06B684EE12892B17EDC

Abstract

One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.

Url:

https://api.istex.fr/document/1A96B9FC68E740E8E445E06B684EE12892B17EDC/fulltext/pdf

DOI: 10.2478/v10209-011-0015-8

Affiliations:

Pologne

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 002F20
to stream Istex, to step Curation: 002C89
to stream Istex, to step Checkpoint: 000006
to stream Main, to step Merge: 000220
to stream Main, to step Curation: 000216

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Google Books Ngrams Recompressed and Searchable</title>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</author>
<author><name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1A96B9FC68E740E8E445E06B684EE12892B17EDC</idno>
<date when="2012-12-22" year="2012">2012-12-22</date>
<idno type="doi">10.2478/v10209-011-0015-8</idno>
<idno type="url">https://api.istex.fr/document/1A96B9FC68E740E8E445E06B684EE12892B17EDC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">002F20</idno>
<idno type="wicri:Area/Istex/Curation">002C89</idno>
<idno type="wicri:Area/Istex/Checkpoint">000006</idno>
<idno type="wicri:doubleKey">0867-6356:2012:Grabowski S:google:books:ngrams</idno>
<idno type="wicri:Area/Main/Merge">000220</idno>
<idno type="wicri:Area/Main/Curation">000216</idno>
<idno type="wicri:Area/Main/Exploration">000216</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Google Books Ngrams Recompressed and Searchable</title>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation wicri:level="1"><country xml:lang="fr">Pologne</country>
<wicri:regionArea>Lodz University of Technology, Institute of Applied Computer Science, al. Politechniki 11, 90-924 Łódź</wicri:regionArea>
<wicri:noRegion>90-924 Łódź</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
<affiliation wicri:level="1"><country xml:lang="fr">Pologne</country>
<wicri:regionArea>University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin</wicri:regionArea>
<wicri:noRegion>71-101 Szczecin</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Foundations of Computing and Decision Sciences</title>
<idno type="ISSN">0867-6356</idno>
<idno type="eISSN">2300-3405</idno>
<imprint><publisher>Versita</publisher>
<date type="published" when="2012-12-01">2012-12-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="271">271</biblScope>
<biblScope unit="page" to="281">281</biblScope>
</imprint>
<idno type="ISSN">0867-6356</idno>
</series>
<idno type="istex">1A96B9FC68E740E8E445E06B684EE12892B17EDC</idno>
<idno type="DOI">10.2478/v10209-011-0015-8</idno>
<idno type="ArticleID">v10209-011-0015-8</idno>
<idno type="Related-article-Href">v10209-011-0015-8.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0867-6356</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.</div>
</front>
</TEI>
<affiliations><list><country><li>Pologne</li>
</country>
</list>
<tree><country name="Pologne"><noRegion><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</noRegion>
<name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000216 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000216 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:1A96B9FC68E740E8E445E06B684EE12892B17EDC
   |texte=   Google Books Ngrams Recompressed and Searchable
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Google Books Ngrams Recompressed and Searchable

Google Books Ngrams Recompressed and Searchable

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri